---
editor_options: 
  markdown: 
    wrap: 72
---

## Legislators talk less about the future as they age -- replication materials

This folder contains materials necessary to reproduce the results
contained in the above article and additional analyses contained in the
supplementary information.

The folder contains all the materials necessary to reproduce the results
*starting from the data on parliamentary speech*. This involves:

-   fine-tuning a large language model
-   making predictions for parliamentary corpora
-   aggregating these predictions to form country-specific datasets
-   estimating various country-specific models

The first two steps are particularly compute intensive. We therefore
additionally include the data-sets used to estimate these different
models, so that researchers who are *only* interested in the quantities
presented in the paper can replicate these models.

Note that we include the code we used to fine tuned the model, but the
code used to generate predictions relies on the version of this
fine-tuned model archived on HuggingFace.

# Project structure

The project is divided into five folders. There is one folder,
`common/`, which contains script necessary to fine-tune the large
language model, and scripts to combine the outputs of the other folders.
There are then four country-specific folders. Each folder contains the
following subfolders:

-   data/
    -   speech/
    -   aux/
-   working/
-   R/
-   python/
-   outputs

# Data requirements

The project relies on corpora containing the full-text of parliamentary
speech. We provide scripts to download parliamentary speech for the UK.
We provide the data for Australia. We do not have permission to
redistribute the corpora produced by Beelen et al. (2017) or Herzog et
al. (2017). Researchers will therefore need to:

-   download the full (tarred, gzipped) Canadian Hansard dataset from
    [lipad.ca](https://www.lipad.ca/media/lipadcsv-1.1.0.tar.bz2) and
    extract it to the `can/data/speech` folder
-   download the (tarred, gzipped) data on members of the Commons from
    [lipad.ca](https://www.lipad.ca/media/ca-members.tar.gz) and extract
    it to `can/data/aux`

and

-   download the original Dáil data from [Harvard
    Dataverse](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/6MZN76)
    and extract it to the `irl/data/speech` folder.

# Data reuse

The four country-specific folders use different identifiers for the MPs.
In Canada and Ireland, the legislators are identified using the
identifiers used in the projects which created that data. In Australia,
the Senators are identified using their Parliamentary Library codes, as
used in the R package `ausPH`. In the UK, MPs are identified by their
Wikidata codes.

# Software requirements

The code relies on the R packages listed below and any dependencies they
may have. In brackets we give the version we used to produce these
estimates.

-   arrow (18.1.0.1)\
-   data.table (1.16.4)\
-   doParallel (1.0.17)\
-   dplyr (1.1.4)\
-   foreach (1.5.2)\
-   furrr (0.3.1)\
-   glue (1.8.0)\
-   gratia (0.10.0)\
-   here (1.0.1)\
-   hrbrthemes (0.8.7)\
-   jsonlite (2.0.0)\
-   knitr (1.50)\
-   marginaleffects (0.24.0)\
-   mgcv (1.9.1)\
-   mlr3measures (1.0.0)\
-   modelsummary (2.2.0)\
-   parallel (4.3.3)\
-   rvest (1.0.4)\
-   tidytext (0.4.2)\
-   tidyverse (2.0.0)\
-   xml2 (1.3.8)

All of these packages are available on CRAN.

The code relies on the Python packages listed below and any dependencies
they may have.

-   datasets (2.12.0)
-   pandas (1.5.3)
-   pyreadr (0.4.7)
-   torch (2.0.1)
-   transformers (4.29.2)
-   numpy (1.24.3)
-   evaluate (0.4.0)

Finally, the code uses the command line tool `rsync` to download files
for the UK. The same files can, of course, be downloaded manually.

# System requirements

This replication package is compute intensive. This is particularly true
of generating inferences from the fine-tuned large language model. We
strongly recommend only running that part of the code on a system that
has a Nvidia GPU that is properly configured to work with pytorch. If
you have more than 4Gb *graphics card* memory, you should adjust the
batch size in `common/python/03_predict_focus.py` from 12 to a larger
number (for example: 24 if you have 8 Gb).

Generally, the code will run faster if you have a faster CPU with
multiple threads and lots of memory. You can set the number of threads
for some of the modelling in `common/R/model_funcs.R`. The default is
twenty.

Because the scripts use the finetuned models stored on HuggingFace, your
machine will need internet access.

# Information about the system used to run this code

**x86**

This code was run on a personal computer running Ubuntu 24.02.2 LTS. R
version 4.3.3 and Python 3.11.3 were installed. The CPU was an AMD Ryzen
9 3900X 12-Core Processor running at 4.6 Ghz. The GPU was a NVIDIA
GeForce GTX 1650 with 4 Gb memory. 64 Gb of memory were available.

The graphics card used is an old graphics card (released 2019) which is
not particularly powerful. It falls below the minimum requirements for
major 2025 releases in PC gaming.

**Apple Silicon**

The code was run on a Mac Mini 2023, Apple M2 Pro. 16GB shared system
memory. Python 3.11 installed. R version 4.5.0.

# Running order

**Pre-run instructions**

Ensure that speech data is extracted to the speech folder for each
legislature. On some systems this can prove unintuitive. It may be
helpful to decompress the package, rename the resulting folder to
'speech', and replace the empty existing folder in the relevant
directory.

Cloud-based back-up applications such as Dropbox can interfere with the
running of the scripts when online-only or automatic cloud storage is
enabled. It is best to store the whole folder locally until the
replication is complete.

> **Linux**
>
> Each folder contains a shell script called `doall.sh` which will run
> all the code for that country. This code relies on the fact that the
> fine-tuned LLM is available from HuggingFace.

> **MacOS**
>
> Each folder contains a shell script called `doall_macos.sh` which will
> run all the code for that country. To run the .sh file in a MacOS
> `zsh` terminal, type in `sh`, add a space and drag the file
> `doall_macos.sh` into the terminal. Hit enter.

Once all the country-specific analyses have been run, you can run the
plotting code contained in `common/R`. This will produce the figures
found in the article.

# References

Beelen, K., Thijm, T. A., Cochrane, C., Halvemaan, K., Hirst, G.,
Kimmins, M., Lijbrink, S., Marx, M., Naderi, N., Rheault, L., et al.
(2017). Digitization of the Canadian parliamentary debates. Canadian
Journal of Political Science, 50(3), 849–864.

Herzog, A., & Mikhaylov, S. J. (2017). Database of parliamentary
speeches in Ireland, 1919–2013. 2017 International Conference on the
Frontiers and Advances in Data Science (FADS), 29–34.
